Using a Random Forest proximity measure for variable importance stratification in genotypic data
نویسندگان
چکیده
In this work we study variable-significance in classification using the Random Forest proximity matrix and local Importance matrix. We use the proximity m atrix t o g roup t he s amples acr oss a number of c lusters a nd use t hese clusters to s tratify th e importance of a variable. We apply t his a pproach t o a cardiovascular g enotype d ataset f or sample classification b ased o n coronary heart disease and we found a number of variations related with cardiovascular disease phenotypes. We also used a set of phenotypes related with this genotype data to match the obtained clusters with coronary heart diseases phenotypes.
منابع مشابه
Gene Selection Using Random Forest and Proximity Differences Criterion on DNA Microarray Data
Selection of relevant genes for sample classification is a common task in most gene expression studies. As a powerful classification approach, random forest has been applied in this field, and it shows excellent performance compared with other classification methods. The measure of variable importance is the key of gene selection using random forest. However, the existing methods just consider ...
متن کاملRandom Forest Visualization
Classification is the process of assigning a class label to an observation based on its proprieties or attributes. A classification algorithm is applied to a data set, producing a model. By studying the model, insights about the data set structure can be gained. The benefits that a model can bring depend on the model. In this work, a Random Forest model is used for the analysis of data. A Rando...
متن کاملLetter to the Editor: On the stability and ranking of predictors from random forest variable importance measures
A recent study examined the stability of rankings from random forests using two variable importance measures (mean decrease accuracy (MDA) and mean decrease Gini (MDG)) and concluded that rankings based on the MDG were more robust than MDA. However, studies examining data-specific characteristics on ranking stability have been few. Rankings based on the MDG measure showed sensitivity to within-...
متن کاملA Random Forest proximity matrix as a new measure for gene annotation
In this paper we present a new score for gene annotation. This new score is based on the proximity matrix obtained from a trained Random Forest (RF) model. As an example application, we built this model using the association pvalues of genotype with blood phenotype as input and the association of genotype data with coronary heart disease as output. This new score has been validated by comparing...
متن کاملClassification of large datasets using Random Forest Algorithm in various applications: Survey
Random Forest is an ensemble of classification algorithm widely used in much application especially with larger datasets because of its outstanding features like Variable Importance measure, OOB error detection, Proximity among the feature and handling of imbalanceddatasets. This paper discusses many applications which use Random Forest to classify the dataset like Network intrusion detection, ...
متن کامل